{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# PyTorch: Using `ShardReader` to read WebDataset formatted Shards\n",
    "\n",
    "The `ShardReader` class can be used to read WebDataset formatted shards from buckets and objects through URLs or by passing in buckets directly. The `ShardReader` class will yield an iterator contain a tuple with the sample basename and a sample content dictionary. This dictionary is keyed by file extension (e.g \"png\") and has values containing the contents of the associated file in bytes. So, given a shard with a sample in it containing a \"cls\" and \"png\" file, you can read the shard using `ShardReader` and then access the sample and it's contents directly by iterating through the `ShardReader` instance."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Install necessary packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from aistore.sdk import Client\n",
    "from aistore.pytorch.shard_reader import AISShardReader"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Run an AIStore Cluster, either locally or elsewhere, and configure the endpoint and bucket which you want to use"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "AIS_ENDPOINT = \"http://localhost:8080\"\n",
    "AIS_PROVIDER = \"ais\"\n",
    "BCK_NAME = \"test-data\"\n",
    "\n",
    "client = Client(endpoint=AIS_ENDPOINT)\n",
    "bucket = client.bucket(BCK_NAME, AIS_PROVIDER).create(exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Populate the bucket with WebDataset formatted shards using the AIS CLI\n",
    "\n",
    "To download the entire set:\n",
    "\n",
    "```console\n",
    "ais start download \"https://storage.googleapis.com/webdataset/testdata/publaynet-train-{000000..000009}.tar\" ais://test-data\n",
    "```\n",
    "\n",
    "You can use your own data here as well. Just ensure that your bucket has shards that are formatted in WebDataset format."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Create a ShardReader and use it to read your bucket "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shard_reader = AISShardReader(bucket_list=bucket)\n",
    "\n",
    "# Note that the webdataset format stores multiple files to one dataset indexed by basename\n",
    "for basename, content_dict in shard_reader:\n",
    "    print(basename, list(content_dict.keys()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### You can also use a `DataLoader` if you would like"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import DataLoader\n",
    "\n",
    "loader = DataLoader(shard_reader, batch_size=60, num_workers=4)\n",
    "\n",
    "# basenames, content_dicts have size batch_size each\n",
    "for basenames, content_dicts in loader:\n",
    "    print(basename, list(content_dict.keys()))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ais",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
